2 research outputs found

    Tracking System Behaviour from Resource Usage Data

    Full text link
    Resource usage data, collected using tools such as TACC Stats, capture the resource utilization by nodes within a high performance computing system. We present methods to analyze the resource usage data to understand the system performance and identify performance anomalies. The core idea is to model the data as a three-way tensor corresponding to the compute nodes, usage metrics, and time. Using the reconstruction error between the original tensor and the tensor reconstructed from a low rank tensor decomposition, as a scalar performance metric, enables us to monitor the performance of the system in an online fashion. This error statistic is then used for anomaly detection that relies on the assumption that the normal/routine behavior of the system can be captured using a low rank approx- imation of the original tensor. We evaluate the performance of the algorithm using information gathered from system logs and show that the performance anomalies identified by the proposed method correlates with critical errors reported in the system logs. Results are shown for data collected for 2013 from the Lonestar4 system at the Texas Advanced Computing Center (TACC

    dynamicMF: A Matrix Factorization Approach to Monitor Resource Usage in High Performance Computing Systems

    Full text link
    High performance computing (HPC) facilities consist of a large number of interconnected computing units (or nodes) that execute highly complex scientific simulations to support scientific research. Monitoring such facilities, in real-time, is essential to ensure that the system operates at peak efficiency. Such systems are typically monitored using a variety of measurement and log data which capture the state of the various components within the system at regular intervals of time. As modern HPC systems grow in capacity and complexity, the data produced by current resource monitoring tools is at a scale that it is no longer feasible to be visually monitored by analysts. We propose a method that transforms the multi-dimensional output of resource monitoring tools to a low dimensional representation that facilitates the understanding of the behavior of a High Performance Computing (HPC) system. The proposed method automatically extracts the low-dimensional signal in the data which can be used to track the system efficiency and identify performance anomalies. The method models the resource usage data as a three dimensional tensor (capturing resource usage of all compute nodes for difference resources over time). A dynamic matrix factorization algorithm, called dynamicMF, is proposed to extract a low-dimensional temporal signal for each node, which is subsequently fed into an anomaly detector. Results on resource usage data collected from the Lonestar 4 system at the Texas Advanced Computing Center show that the identified anomalies are correlated with actual anomalous events reported in the system log messages.Comment: 11 page
    corecore